This model is similar to classic ideal-point models (e.g. Clinton et al 2004; Bafumi et al 2005) that measure “ideology” and latent space models of social network data (e.g. Hoff et al 2002) that measure “homophily”. It was first developed by Barberá (2015): “Birds of the same feather tweet together: Bayesian ideal point estimation using Twitter data” in Political Analysis.
In other words, the idea is to use the social network formed between citizens and political actors in Twitter as a source of information about each other’s ideological positions.
The original model uses a huge adjacency matrix in which the rows are Twitter users who decide to follow a politician based on four parameters:
The “popularity” or “high profile” of the politician.
The “level of political interest” of the user (e.g. how many politicians is she willing to follow on Twitter).
And two parameters that are taken to be the ideological preference for ordinary Twitter users and politicians.
His Stan program is as follows:
readLines("stan/barbera.stan") %>%
writeLines()
// Barberá (2015): ESTIMATION OF THE BAYESIAN SPATIAL FOLLOWING MODEL
data {
int<lower=1> J; // number of twitter users
int<lower=1> K; // number of elite twitter accounts
int<lower=1> N; // N = J x K
int<lower=1,upper=J> jj[N]; // twitter user for observation n
int<lower=1,upper=K> kk[N]; // elite account for observation n
int<lower=0,upper=1> y[N]; // dummy if user i follows elite j
}
parameters {
vector[K] alpha; // popularity parameters
vector[J] beta; // pol. interest parameters
vector[K] phi; // ideology of elite j
vector[J] theta; // ideology of user i
real mu_beta;
real<lower=0.1> sigma_beta;
real mu_phi;
real<lower=0.1> sigma_phi;
real<lower=0.1> sigma_alpha;
real gamma;
}
model {
alpha ~ normal(0, sigma_alpha);
beta ~ normal(mu_beta, sigma_beta);
phi ~ normal(mu_phi, sigma_phi);
theta ~ normal(0, 1);
for (n in 1:N)
y[n] ~ bernoulli_logit(alpha[kk[n]] + beta[jj[n]] -
gamma * square( theta[jj[n]] - phi[kk[n]] ) );
}
However, this original model takes around 20 hours to fit and none of us was able to make it work on our computers. To get around this issue, we modify the original data in order to reduce the amount of parameters that need to be estimated.
Here’s the problem: Barberá’s original dataset consists of an adjacency matrix with 301,537 rows (or regular Twitter users) and 318 columns (or political actors). Each cell is \(y_{ij} \in \{1, 0\}\), indicating whether a specific user follows a specific politician \(j\) or not.
load("data/barberá_replication_set/adj-matrix-US.rdata")
y@Dim
## [1] 301537 318
nnzero(y) / prod(y@Dim)
## [1] 0.02865904
This means the original model estimates 603,074 parameters for the users and 636 parameters for the politicians, from a very sparse matrix (approx 97% of all entries are zero).
Our approach is different. We take the following matrix projections:
\[ \begin{align} &\mathbf{A^\top A} = \text{politician } \times \text{ politician matrix} \\\\ &\mathbf{A A^\top} = \text{user } \times \text{ user matrix} \end{align} \]
The \(\mathbf A = \mathbf{Y^\top Y}\) transformation is a symmetric \(318 \times 318\) matrix in which each entry in the diagonal corresponds to each politicians (in-sample) total follower. Note the following:
\[ \textsf{diag}(\mathbf{A^\top A}) = \textsf{colSums}(\mathbf A) \]
Each cell \(w_{ij} = w_{ji}\) is now the number of mutual followers between politician \(i\) and politician \(j\). Similarly, the cells in the “user-user matrix” correspond to the amount of politicians by which each user is connected.
\[ \textsf{diag}(\mathbf{A A^\top}) = \textsf{rowSums}(\mathbf A) \]
Throughout, we focus on the politician matrix, but note that the same model can be applied to the user matrix (with a different interpretation for the values in the diagonal).
Our initial model for the data is as follows:
\[ \begin{align} &w_{ij} \sim \textsf{binomial}(\min(n_i, n_j), p_{ij}) \\\\ &p_{ij} = \textsf{logit}^{-1}(\alpha_i + \alpha_j - \textsf{dist}(\theta_i, \theta_j)) \\\\ &\textsf{dist}(\theta_i, \theta_j) = | \theta_i - \theta_j| \end{align} \]
Because the matrix is symmetric, we have that \(w_{ij} = w_{ji}\) and that \(p_{ij} = p_{ji}\). We define \(u \to i\) to indicate that a regular Twitter user follos politician \(i\). If we understand our own model correctly, then these conditional probabilities should be equal:
\[ \Pr(u \to i \mid u \to j, \alpha_i, \alpha_j, \theta_i, \theta_j) = \Pr(u \to j \mid u \to i, \alpha_i, \alpha_j, \theta_i, \theta_j) \]
This is reflected in the fact that the matrix is symmetric.
The \(\boldsymbol \alpha\) parameters represent each person’s “popularity” (or “high profile”), which should be roughly proportional to their amount of followers (given by \(\mathbf A\)’s diagonal elements):
\[ \boldsymbol \alpha \propto \textsf{diag}(\mathbf A) \]
The \(\boldsymbol \theta\) parameters are ultimately what we’re interested in. They represent each person’s latent ideology: the closer two people are ideologically, the more mutual followers they will have (after adjusting for each one’s popularity). This is why we use the function \(\textsf{dist}(\theta_i, \theta_j)\), which is usually taken to be the Euclidean distance.
Finally, the \(n\) parameter in the binomial likelihood is taken to be the smallest number of total followers between both politicians. This is because the amoung of mutual followers between \(i\) and \(j\) is logically bounded by the minimum of \(i\) and \(j\)’s total followers.
Also, should we say that \(i = j\), then \(p_{ij} \approx 1\) and the expected number of mutual followers just corresponds to the diagonal in the adjacency matrix. At least, this should be the case (this is one of the ways we propose to expand to the model later on).
Special priors are needed in this model because, as it stands, it is unidentifiable. It has three sources of unidentifiability.
Additive aliasing in popularity: any constant \(k\) can be added to \(\alpha_i\) and then subtracted from \(\alpha_j\), leaving the predictions unchanged.
\[ \alpha_i + \alpha_j = (\alpha_i + k) + (\alpha_j -k) \]
Solution: constrain the \(\alpha\)’s to sum to zero.
Solution: set a prior on the \(\alpha\)’s, as follows:
\[ \boldsymbol \alpha \sim \textsf{normal}(0, 1) \]
\[ \begin{align} \textsf{dist}(\theta_{i}, \theta_{j}) &= \sqrt{(\theta_{id} - \theta_{jd})^2} \\\\ &= \sqrt{([\theta_{id} + k] - [\theta_{jd} + k])^2} \end{align} \]
Solution: constrain the \(\theta\)’s to sum to zero.
Solution: set a prior on the \(\theta\)’s, as follows:
\[ \boldsymbol \theta \sim \textsf{normal}(0, 1) \]
Reflection invariance in ideology: all \(\theta\)’s could be multiplied by \(-1\), leaving predictions unachanged:
\[ \textsf{dist}(\theta_{i}, \theta_{j}) = \textsf{dist}(-\theta_{i}, -\theta_{j}) \]
This will result in a bimodal likelihood and posterior distribution; and if we include both modes, then each parameter will have two maximum likelihood estimates and a posterior mean of \(0\).
To solve reflection invariance in Stan, we can do the following:
Solution 1: restrict the sign of one of the \(\theta\)’s. For example, we get a renowned conservative politician and constrain his \(\theta\) to be positive; this will ensure the polarity of the scale is in a direction that’s both identifiable and easy to interpret.
parameters {
...
real theta[J];
real<upper=0> renowned_left_person;
real<lower=0> renowned_right_person;
}
transformed parameters {
theta[location1] = renowned_left_person;
theta[location2] = renowned_right_person;
}
Or maybe just using the following priors:
parameters {
...
real theta[J];
}
model {
theta[1] ~ normal(-1, 1);
theta[2] ~ normal(1, 1);
for (i in 3:J) {
theta[i] ~ normal(0, 1)
}
}
Solution 2: after estimating the model –with the bimodal distribution– multiply the sampled \(\theta\)’s whenever they’re not in the desired direction This can be done in the generated quantities block, which is calculated after each iteration: if \(\theta_\text{right} < \theta_\text{left}\), then multiply \(\theta\) times \(-1\).
Solution 3: Pick initial values in a clever way –Barberá did this, he picked differing initial values for each party– and then passing them on to the init argument in the stan() or sampling() functions.
Solution 4: An alternative to this solution is to put a different prior for each political party and take better advantage of the multilevel structure of the data. This is what Michael Betancourt ((here)[http://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html]) calls “breaking the labeling degeneracy with non-exchangeable priors”. A similar solution involves using Stan’s ordered type to impose order in the parameters.
The total_followers each politician has are distributed with a very long tail.
The same happens, to a lesser extent, with the amount of mutual_followers.
A <- t(y) %*% y
## Warning: Removed 15 rows containing non-finite values (stat_bin).
The “Warning: Removed 15 rows containing non-finite values (stat_bin)” in the last graph indicates that 15 pairs of politicians have 0 mutual followers amongst themselves.
In order to fit the data into Stan, we put the data into “long form”, using the following code:
as_long_form <- function(M) {
N <- sum(upper.tri(M, diag = FALSE))
output <- matrix(NA, ncol = 3, nrow = N)
k <- 1
for (i in 1:(ncol(M) - 1)) {
for (j in (i + 1):ncol(M)) {
output[k, 1:2] <- c(i, j)
output[k, 3] <- M[i, j]
k <- k + 1
}
}
colnames(output) <- c("ii", "jj", "W")
return(output)
}
df <- as_long_form(A) %>% as_tibble()
df$name_ii <- colnames(A)[df$ii]
df$name_jj <- colnames(A)[df$jj]
The transformed adjacency matrix looks like this:
A[1:15, 1:6]
## BarackObama nytimes Schwarzenegger algore maddow FoxNews
## BarackObama 191986 96231 5577 29270 93220 30756
## nytimes 96231 119115 3797 19966 57325 27957
## Schwarzenegger 5577 3797 8041 2018 2417 2968
## algore 29270 19966 2018 32941 20226 5742
## maddow 93220 57325 2417 20226 109996 13837
## FoxNews 30756 27957 2968 5742 13837 77770
## MittRomney 52707 35970 4006 9480 21700 54776
## MMFlint 49493 29731 2009 12834 39015 6464
## JerryBrownGov 9704 6900 612 2800 7445 2693
## SarahPalinUSA 6606 4628 1169 1971 3438 11237
## current 12623 9309 414 3974 11218 2336
## glennbeck 14395 10927 1639 2457 7505 33333
## KarlRove 20625 16764 1885 4302 12248 33517
## KeithOlbermann 46562 28228 1551 11622 42803 7001
## RonPaul 14588 10105 1456 3436 7304 12014
And the resulting data frame looks like this:
df
## # A tibble: 50,403 x 5
## ii jj W name_ii name_jj
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 2 96231 BarackObama nytimes
## 2 1 3 5577 BarackObama Schwarzenegger
## 3 1 4 29270 BarackObama algore
## 4 1 5 93220 BarackObama maddow
## 5 1 6 30756 BarackObama FoxNews
## 6 1 7 52707 BarackObama MittRomney
## 7 1 8 49493 BarackObama MMFlint
## 8 1 9 9704 BarackObama JerryBrownGov
## 9 1 10 6606 BarackObama SarahPalinUSA
## 10 1 11 12623 BarackObama current
## # ... with 50,393 more rows
Simulate fake data and check that the model recovers the parameters.
Data generating process for 20 politicians (10 from each party) and 10,000 users.
K is the number of politicians
N \(= \binom{K}{2}\) is the total number of observations (assuming no missing data).
theta are the ideology parameters, assumed to come from bimodal distribution (i.e. we assume polarization); labels for each politician’s party are also provided.
total_followers are the elements in the diagonal, they come from a positive \(t\)-student distribution, rounding up to the nearest integer.
alpha is a scaled down version of \(\log\)(total_followers), plus some random noise.
simulate_data <- function(K) {
N <- choose(K, 2)
theta <- c(rnorm(K/2, 1.5, 1), rnorm(K/2, -1.5, 1))
total_followers <- round(abs(rt(K, df = 1)*1e4))
alpha <- as.vector(scale(log(total_followers))) + rnorm(K, sd = 0.2)
party <- c(rep("A", K/2), rep("B", K/2))
names(theta) <- party
theta <- sort(theta)
party <- names(theta)
invlogit <- function(x) 1 / (1 + exp(-x)) ## inverse logit
output <- matrix(NA, ncol = K, nrow = K) ## generate adj. matrix
for (i in 1:(ncol(output) - 1)) {
for (j in (i + 1):ncol(output)) {
p <- invlogit(alpha[i] + alpha[j] - abs(theta[i] - theta[j]))
output[i, j] <- rbinom(1, size = min(total_followers[i], total_followers[j]),
prob = p)
output[j, i] <- output[i, j]
colnames(output) <- rownames(output) <- str_c("p", 1:K)
}
}
diag(output) <- total_followers
return(list(
A = output,
pars = list(theta = theta, alpha = alpha),
party = party
))
}
And here are the results from one simulation (which look somewhat like the real data):
set.seed(123)
K <- 20
simulation <- simulate_data(K)
str(simulation)
## List of 3
## $ A : num [1:20, 1:20] 15415 5215 3098 1653 2650 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:20] "p1" "p2" "p3" "p4" ...
## .. ..$ : chr [1:20] "p1" "p2" "p3" "p4" ...
## $ pars :List of 2
## ..$ theta: Named num [1:20] -3.47 -2.06 -1.97 -1.39 -1.14 ...
## .. ..- attr(*, "names")= chr [1:20] "B" "B" "B" "B" ...
## ..$ alpha: num [1:20] 0.264 0.48 0.124 0.125 0.477 ...
## $ party: chr [1:20] "B" "B" "B" "B" ...
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Getting the data into long form, so they can be fed into Stan:
sim_long <- as_long_form(simulation$A) %>% as_tibble()
min_n <- vector("double", nrow(sim_long))
for (i in 1:nrow(sim_long)) {
min_n[[i]] <- min(diag(simulation$A)[sim_long$ii][i],
diag(simulation$A)[sim_long$jj][i])
}
simulation_data <- list(
N = choose(K, 2),
K = K,
ii = sim_long$ii,
jj = sim_long$jj,
W = sim_long$W,
min_n = min_n
)
Fitting the model to the fake data
WARNING: this model doesn’t deal with “reflection invariance” at all, so we expect it to perform poorly
readLines("stan/model1.stan") %>%
writeLines()
// Model 1, with reflection invariance and possibly other issues.
data {
int N; // number of observations
int K; // number of political actors
int<upper=K> ii[N]; // first actor for observation n
int<upper=K> jj[N]; // second actor for observation n
int<lower=0> W[N]; // number of mutual followers for observation n
//int<lower=0> D[K]; // diag(adj. mat.) or number of (in-sample) followers
int<lower=0> min_n[N];
}
// Make the transformed data thing work with min_n
parameters {
vector[K] theta;
vector[K] alpha;
}
model {
vector[N] eta = alpha[ii] + alpha[jj] - sqrt(square(theta[ii] - theta[jj]));
alpha ~ normal(0, 1);
theta ~ normal(0, 1); // should show reflection invariance
W ~ binomial_logit(min_n, eta);
}
model1 <- stan_model("stan/model1.stan")
fit_sim <- sampling(model1, data = simulation_data,
iter = 1000,
seed = 123)
saveRDS(fit_sim, "output/fit_sim.rds")
fit_sim <- readRDS("output/fit_sim.rds")
This model doesn’t fit… which is no surprise because it doesn’t consider “reflection invariance”. The Rhat values are terrible, which indicates that the chains are not exploring the same regions of parameter space
In order to address this issue we take the simulated parameters of theta and choose one of the extreme values; we then nudge that parameter in a certain direction. In the real world, we should have sufficient judgement to do something similar without having access to the “true” values of theta.
The simulate_data() function already ensure that the thetas are sorted, so it’s safe to take the first one and assign it a prior.
Note. We also added a tau parameter in order to tweak how strong the prior is without needing to recompile the Stan program.
readLines("stan/model2.stan") %>%
writeLines()
// Model 2, with fix to reflection invariance
data {
int N; // number of observations
int K; // number of political actors
int<upper=K> ii[N]; // first actor for observation n
int<upper=K> jj[N]; // second actor for observation n
int<lower=0> W[N]; // number of mutual followers for observation n
int<lower=0> min_n[N];
real<lower=0> tau; // influence of the prior
}
parameters {
vector[K] theta;
vector[K] alpha;
}
model {
vector[N] eta = alpha[ii] + alpha[jj] - sqrt(square(theta[ii] - theta[jj]));
alpha ~ normal(0, 1);
theta[1] ~ normal(-1, tau);
for (k in 2:(K-1)) {
theta[k] ~ normal(0, 1);
}
theta[K] ~ normal(1, tau); // Please take note of the ordering of theta.
W ~ binomial_logit(min_n, eta);
}
model2 <- stan_model("stan/model2.stan")
simulation_data$tau <- 0.5
fit_sim2 <- sampling(model2, data = simulation_data,
iter = 1000, seed = 123)
saveRDS(fit_sim2, "output/fit_sim2.rds")
fit_sim2 <- readRDS("output/fit_sim2.rds")
So this will mean that the chains are still not mixing well. The following plots shows the true value of some parameters, compared with the true value from the simulation (dashed line):
This new version of model 2 imposes harsher constraints on the first value of theta (making it always negative).
readLines("stan/model2b.stan") %>%
writeLines()
// Model 2, with harsher fix to reflection invariance
data {
int N; // number of observations
int K; // number of political actors
int<upper=K> ii[N]; // first actor for observation n
int<upper=K> jj[N]; // second actor for observation n
int<lower=0> W[N]; // number of mutual followers for observation n
int<lower=0> min_n[N];
real<lower=0> tau; // influence of the prior
}
// Make the transformed data thing work with min_n
parameters {
real<upper=0> theta_1; // "ideology" for first in list (truncated)
vector[K-1] theta_rest; // "ideology" for k - 1
vector[K] alpha; // "popularity" for k
}
model {
vector[K] theta = append_row(theta_1, theta_rest);
vector[N] eta = alpha[ii] + alpha[jj] - sqrt(square(theta[ii] - theta[jj]));
alpha ~ normal(0, 1);
theta_1 ~ normal(-1.5, tau); // These two lines hopefully fix
theta_rest ~ normal(0, 1); // the "reflection invariance" issue
W ~ binomial_logit(min_n, eta);
}
model2b <- stan_model("stan/model2b.stan")
fit_sim2b <- sampling(model2b, data = simulation_data,
iter = 1000, seed = 123)
saveRDS(fit_sim2b, "output/fit_sim2b.rds")
fit_sim2b <- readRDS("output/fit_sim2b.rds")
What’s going on here? Probably, when the person we nudge to be on the left isn’t far enough on the left, some chains might get stuck on the zero value, and this doesn’t provide information for the estimation of the other parameters. We thought that by truncating \(\theta_1\) at zero we were placing more density exactly at zero, but this doesn’t seem to be the case.
expose_stan_functions("stan/normal_trunc_dgp.stan")
sim_y <- replicate(1e4, normal_ub_rng(-1, 1, 0))
init trickFirst try
This model takes the initial values for theta from a uniform distribution from -1 to 1, following roughly the order of the true values of theta.
theta_inits <- rep(list(list(theta = sort(runif(K, -1, 1)))), 4)
fit_sim2_init <- sampling(model2, data = simulation_data,
iter = 1000, seed = 123,
init = theta_inits,
control = list(max_treedepth = 15))
saveRDS(fit_sim2_init, "output/fit_sim2_init.rds")
fit_sim2_init <- readRDS("output/fit_sim2_init.rds")
Second try
Okay, so this “pick your best initial value” strategy works well, but at the same time it seems too ad hoc. Furthermore, in the real world we don’t know the true values of \(\boldsymbol \theta\), so there’s no way to actually use this.
The next fit follows Barberá’s lead and takes initial values in accordance to each politician’s party:
theta_inits2 <- rep(list(list(
theta = rnorm(K, mean = ifelse(simulation$party == "B", -1, 1)))),
4)
fit_sim2_initb <- sampling(model2, data = simulation_data,
iter = 1000, seed = 123,
init = theta_inits2,
control = list(max_treedepth = 15))
saveRDS(fit_sim2_initb, "output/fit_sim2_initb.rds")
fit_sim2_initb <- readRDS("output/fit_sim2_initb.rds")
Third try
Finally, the next initialization (for model2b) only takes an initial value for a “renowned extreme value” person in the dataset:
theta_inits3 <- replicate(4, list(list(theta_1 = rnorm(1, -1, 0.2))))
fit_sim2b_init <- sampling(model2b, data = simulation_data,
iter = 1000, seed = 123,
init = theta_inits3,
control = list(max_treedepth = 15))
saveRDS(fit_sim2b_init, "output/fit_sim2b_init.rds")
fit_sim2b_init <- readRDS("output/fit_sim2b_init.rds")
Assume popularity is known and measured by \(\textsf{diag}(\mathbf A)\)
We already know that the popularity parameters should be proportional to the popularity parameters. Thus, it’s reasonable to include this additional information in the model.
\[ \boldsymbol \alpha \propto \textsf{diag}(\mathbf A) \]
A reasonable solution is as follows:
\[ \begin{align} &x_i = \textsf{diag}(\mathbf A)_i \\\\ &z_i = \frac{x_i - \bar x}{\textsf{sd}(x)} \\\\ &w_{ij} \sim \textsf{binomial}(\min(n_i, n_j), p_{ij}) \\\\ &p_{ij} = \textsf{logit}^{-1}(\beta(z_i + z_j) - \textsf{dist}(\theta_i, \theta_j)) \end{align} \]
In words: we replace the \(\boldsymbol \alpha\) parameters with the scaled down amount of total followers, and add a \(\beta\) parameter. The reason to do this is that putting the \(x\) values directly would mean very low levels of \(\beta\) and that probably cause “underflow” in the Stan program. Note that this reduces the amount of parameters Stan has to estimate by half.
Also note the new transformed data block in this model, which should make things easier for the rest of models.
readLines("stan/model3.stan") %>%
writeLines()
// Model 3, with total follower data
data {
int N; // number of observations
int K; // number of political actors
int<upper=K> ii[N]; // first actor for observation n
int<upper=K> jj[N]; // second actor for observation n
int<lower=0> W[N]; // number of mutual followers for observation n
int<lower=0> D[K]; // diag(adj. mat.) or number of (in-sample) followers
real<lower=0> tau;
}
transformed data {
vector[K] z;
int min_n[N];
z = (to_vector(D) - mean(to_vector(D))) / sd(to_vector(D)); // vectorized
for (n in 1:N) {
min_n[n] = min(D[ii[n]], D[jj[n]]); // not vectorized
}
}
parameters {
vector[K] theta; // "ideology" for k
real beta;
real alpha;
}
model {
vector[N] eta = alpha + beta*(z[ii] + z[jj]) - sqrt(square(theta[ii]-theta[jj]));
theta[1] ~ normal(-2, tau)T[ , 0];
for (k in 2:(K-1)) {
theta[k] ~ normal(0, 1);
}
theta[K] ~ normal(2, tau)T[0, ];
beta ~ normal(0, 5);
alpha ~ normal(0, 5);
W ~ binomial_logit(min_n, eta);
}
simulation_data2 <- list(
N = choose(K, 2),
K = K,
ii = sim_long$ii,
jj = sim_long$jj,
W = sim_long$W,
D = diag(simulation$A),
tau = 0.5
)
model3 <- stan_model("stan/model3.stan")
fit_sim3 <- sampling(model3, data = simulation_data2,
seed = 123, iter = 1000,
control = list(max_treedepth = 15,
adapt_delta = 0.9)
)
saveRDS(fit_sim3, "output/fit_sim3.rds")
fit_sim3 <- readRDS("output/fit_sim3.rds")
This model didn’t fit well.
In the next fit we do the same thing, but use clever initialization values.
fit_sim3b <- sampling(model3, data = simulation_data2,
seed = 123, iter = 2000,
control = list(max_treedepth = 15,
adapt_delta = 0.9),
init = theta_inits2
)
saveRDS(fit_sim3b, "output/fit_sim3b.rds")
fit_sim3b <- readRDS("output/fit_sim3b.rds")
print(fit_sim3b, pars = "theta", probs = c(0.25, 0.5, 0.75))
Inference for Stan model: model3.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 25% 50% 75% n_eff Rhat
theta[1] -3.71 0.02 0.20 -3.84 -3.71 -3.58 96 1.06
theta[2] -2.04 0.03 0.20 -2.16 -2.03 -1.90 40 1.09
theta[3] -2.27 0.04 0.20 -2.40 -2.26 -2.13 24 1.11
theta[4] -1.59 0.02 0.20 -1.71 -1.58 -1.46 124 1.06
theta[5] -1.21 0.02 0.20 -1.34 -1.21 -1.08 120 1.06
theta[6] -2.52 0.22 0.36 -2.78 -2.62 -2.34 3 2.04
theta[7] -1.20 0.02 0.20 -1.32 -1.20 -1.07 127 1.05
theta[8] -0.96 0.02 0.20 -1.09 -0.96 -0.83 127 1.05
theta[9] -0.59 0.02 0.20 -0.72 -0.59 -0.46 128 1.05
theta[10] -0.03 0.07 0.22 -0.17 -0.03 0.11 9 1.20
theta[11] 0.10 0.02 0.19 -0.02 0.10 0.25 146 1.04
theta[12] 0.97 0.38 0.58 0.59 0.74 1.03 2 2.65
theta[13] 1.30 0.02 0.20 1.17 1.30 1.43 90 1.07
theta[14] 0.81 0.02 0.20 0.69 0.82 0.94 131 1.05
theta[15] 2.96 0.15 0.28 2.79 3.00 3.16 4 1.54
theta[16] 1.62 0.02 0.20 1.50 1.63 1.75 127 1.05
theta[17] 0.78 0.02 0.20 0.66 0.79 0.91 132 1.05
theta[18] 3.16 0.41 0.61 2.76 2.91 3.36 2 3.45
theta[19] 2.30 0.02 0.19 2.17 2.30 2.43 149 1.04
theta[20] 3.33 0.07 0.22 3.20 3.34 3.48 9 1.19
Samples were drawn using NUTS(diag_e) at Sun Dec 9 18:17:13 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
This one fits much better, but it’s still not good!
Model 3b This one doesn’t standardize the data, but rather takes the log of total followers. Here, the intercept coefficient should be negative.
\[ \begin{align} &x_i = \textsf{diag}(\mathbf A)_i \\\\ &w_{ij} \sim \textsf{binomial}(\min(n_i, n_j), p_{ij}) \\\\ &p_{ij} = \textsf{logit}^{-1}(\alpha + \beta\log(x_i + x_j) - \textsf{dist}(\theta_i, \theta_j)) \end{align} \]
readLines("stan/model3b.stan") %>%
writeLines()
// Model 3B, with LOG total follower data
data {
int N; // number of observations
int K; // number of political actors
int<upper=K> ii[N]; // first actor for observation n
int<upper=K> jj[N]; // second actor for observation n
int<lower=0> W[N]; // number of mutual followers for observation n
int<lower=0> D[K]; // diag(adj. mat.) or number of (in-sample) followers
real<lower=0> tau;
}
transformed data {
int min_n[N];
for (n in 1:N) {
min_n[n] = min(D[ii[n]], D[jj[n]]); // not vectorized
}
}
parameters {
vector[K] theta; // "ideology" for k
real beta;
real alpha;
}
transformed parameters {
vector[N] eta;
eta = alpha + beta*(log(to_vector(D[ii]) + to_vector(D[jj]))) -
sqrt(square(theta[ii]-theta[jj]));
}
model {
theta[1] ~ normal(-2, tau);
for (k in 2:(K-1)) {
theta[k] ~ normal(0, 1);
}
theta[K] ~ normal(2, tau);
beta ~ normal(0, 5);
alpha ~ normal(0, 5);
W ~ binomial_logit(min_n, eta);
}
generated quantities {
vector[N] y;
for (n in 1:N) {
y[n] = binomial_rng(min_n[n], inv_logit(eta[n]));
}
}
model3b <- stan_model("stan/model3b.stan")
fit_sim3log <- sampling(model3b, data = simulation_data2,
seed = 123, iter = 1000,
control = list(max_treedepth = 15,
adapt_delta = 0.95),
init = theta_inits2
)
saveRDS(fit_sim3log, "output/fit_sim3log.rds")
fit_sim3log <- readRDS("output/fit_sim3log.rds")
print(fit_sim3log, pars = "theta", probs = c(0.25, 0.5, 0.75))
Inference for Stan model: model3b.
4 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=2000.
mean se_mean sd 25% 50% 75% n_eff Rhat
theta[1] -3.69 0.02 0.20 -3.83 -3.71 -3.53 139 1.04
theta[2] -2.11 0.02 0.20 -2.25 -2.13 -1.95 138 1.04
theta[3] -2.20 0.02 0.20 -2.33 -2.21 -2.03 139 1.04
theta[4] -1.54 0.02 0.20 -1.68 -1.56 -1.38 138 1.04
theta[5] -1.24 0.02 0.20 -1.38 -1.26 -1.08 138 1.04
theta[6] -2.42 0.02 0.20 -2.57 -2.44 -2.26 139 1.04
theta[7] -1.21 0.02 0.20 -1.35 -1.23 -1.05 138 1.04
theta[8] -0.93 0.02 0.20 -1.07 -0.95 -0.77 138 1.04
theta[9] -0.49 0.02 0.20 -0.63 -0.51 -0.33 138 1.04
theta[10] 0.09 0.02 0.20 -0.05 0.07 0.25 138 1.04
theta[11] 0.22 0.02 0.20 0.08 0.20 0.38 138 1.04
theta[12] 2.39 0.17 0.31 2.13 2.40 2.61 3 1.69
theta[13] 1.43 0.02 0.20 1.29 1.41 1.59 140 1.04
theta[14] 0.88 0.02 0.20 0.74 0.86 1.04 138 1.04
theta[15] 1.99 0.02 0.20 1.85 1.97 2.15 140 1.04
theta[16] 1.64 0.02 0.20 1.50 1.62 1.80 139 1.04
theta[17] 1.07 0.02 0.20 0.93 1.05 1.23 138 1.04
theta[18] 3.28 0.14 0.28 3.09 3.29 3.51 4 1.50
theta[19] 2.46 0.02 0.20 2.32 2.44 2.62 134 1.05
theta[20] 2.98 0.02 0.20 2.83 2.99 3.14 100 1.06
Samples were drawn using NUTS(diag_e) at Sun Dec 9 18:23:23 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
In this final model, we enforce an ordering of the “ideology population means”, such that \(\mu_A < \mu_B\). This follows Betancourt’s solution to identifiability in the section of his case study titled “Breaking the Labeling Degeneracy by Enforcing an Ordering”.
Note that we’re not only ordering the population means, but that we’re also giving them assymetric priors; and that we also give assymetric priors to two \(\theta_i\) parameters.
pindex <- ifelse(simulation$party == "B", 1, 2)
simulation_data2$pindex <- pindex
readLines("stan/model4.stan") %>%
writeLines()
// Model 4, with two means
data {
int N; // number of observations
int K; // number of political actors
int<upper=K> ii[N]; // first actor for observation n
int<upper=K> jj[N]; // second actor for observation n
int<lower=0> W[N]; // number of mutual followers for observation n
int<lower=0> D[K]; // diag(adj. mat.) or number of (in-sample) followers
int<lower=0> pindex[K];
real<lower=0> tau;
}
transformed data {
int min_n[N];
for (n in 1:N) { // not vectorized
min_n[n] = min(D[ii[n]], D[jj[n]]);
}
}
parameters {
vector[K] theta;
real beta;
real alpha;
ordered[2] mu;
}
transformed parameters {
vector[N] eta;
eta = alpha + beta*(log(to_vector(D[ii]) + to_vector(D[jj]))) -
sqrt(square(theta[ii]-theta[jj]));
}
model {
mu[1] ~ normal(-1, 1);
mu[2] ~ normal(1, 1);
theta[1] ~ normal(-2, tau);
for (k in 2:(K-1)) {
theta[k] ~ normal(mu[pindex[k]], 1);
}
theta[K] ~ normal(2, tau);
beta ~ normal(0, 5);
alpha ~ normal(0, 5);
W ~ binomial_logit(min_n, eta);
}
generated quantities {
vector[N] y;
for (n in 1:N) {
y[n] = binomial_rng(min_n[n], inv_logit(eta[n]));
}
}
In math:
\[ \begin{align} &x_i = \textsf{diag}(\mathbf A)_i \\\\ &w_{ij} \sim \textsf{binomial}(\min(n_i, n_j), p_{ij}) \\\\ &p_{ij} = \textsf{logit}^{-1}(\alpha + \beta\log(x_i + x_j) - \textsf{dist}(\theta_i, \theta_j)) \end{align} \]
model4 <- stan_model("stan/model4.stan")
fit_sim4 <- sampling(model4, data = simulation_data2,
seed = 123, iter = 2000,
control = list(max_treedepth = 15,
adapt_delta = 0.95),
init = theta_inits2)
saveRDS(fit_sim4, "output/fit_sim4.rds")
fit_sim4 <- readRDS("output/fit_sim4.rds")
Overall, model 3 and model 4 did better than the rest, which is not surprising given that they have the most bells and whistles (including a clever initialization strategy for the \(\theta\)s). The difference between them is that model 4 includes information on party identification, and so in theory should perform better.
Assessing parameter recovery
draws <- extract(fit_sim4)$theta
colnames(draws) <- paste0("theta[", 1:20, "]")
Roughly half the 50% intervals (95% of the 95% intervals) should contain the true parameter values, given that we have simulated data from the model we are fitting. However, keep in mind that because we only have 20 ideology parameters, the percentages might be a bit off.
between <- function(x, min, max) {
return(x > min & x < max)
}
interval_95 <- apply(draws, MARGIN = 2, quantile,
prob = c(0.025, 0.975))
between(simulation$pars$theta, min = interval_95[1, ],
max = interval_95[2, ]) %>%
mean()
## [1] 0.8
interval_50 <- apply(draws, MARGIN = 2, quantile,
prob = c(0.25, 0.75))
between(simulation$pars$theta, min = interval_50[1, ],
max = interval_50[2, ]) %>%
mean()
## [1] 0.6
So \(\frac{16}{20}\) parameters are contained in the 95% intervals, and \(\frac{12}{20}\) parameters are contained in the 50% intevals. The following plot shows a more complete assessment of parameter recovery
## Warning: package 'bindrcpp' was built under R version 3.4.4
Note that some of the intervals in the above graph are the product of chains that didn’t mixed very well, particulary for theta[6].
What are the strengths and weaknesses of the model. How might the model be expanded?
Overall, we found that this type of model is extremely fragile and that it’s not easy to predict when it will fit and when it won’t. This is a major drawback. To solve this, we were forced to make many “clever” decisions which make reproducibility suspect.
Please note that all previous four models are already expansions of the original model.
In what follows, we propose four ways to expand the model:
A different distance function. It’s very common for social network models to work with the Euclidean distance:
\[ \text{(Euclidean) }\textsf{dist}(\theta_{i}, \theta_{j}) = \left[\sum_d (\theta_{id} - \theta_{jd})^2 \right]^{\frac{1}{2}} \]
Here, \(d\) represents the amount of dimensions we could add, but because we only consider \(d = 1\), the sum drops out. However, we don’t necessarily have to use this type of distance. We can use any other distance metric, as long as it obeys the triangle inequality and that \(f\) is some monotone function.
\[ \textsf{dist}(\theta_{i}, \theta_{j}) = f\left(\left[\sum_d (\theta_{id} - \theta_{jd})^2 \right]^{\frac{1}{2}}\right) \]
In fact, Barberá already uses a different distance:
\[ \text{(Barberá's) } \textsf{dist}(\theta_i, \theta_j) = \gamma (\theta_i - \theta_j)^2 \]
Footnote: Why did he do this? Probably because the abs and fabs functions in Stan “can seriously hinder sampling and optimization efficiency for gradient-based methods (e.g., NUTS, HMC, BFGS) if applied to parameters (including transformed parameters and local variables in the transformed parameters or model block). The problem is that they break gradients due to discontinuities coupled with zero gradients elsewhere. They do not hinder sampling when used in the data, transformed data, or generated quantities blocks” (Stan Manual, p. 436).
The reason for wanting to change the distance function is because when \(\textsf{dist}(\theta_i, \theta_j)\) approaches zero, the probability inside the binomial likelihood is not increasing. That is, \(\textsf{dist}(\theta_i, \theta_j)\) only has the effect of taking away from \(p\) when the distance is large, but it never adds to it.
Adjust for geography. Presumably, the “following behavior” of Twitter users follows a geographical pattern (i.e. the closer these politicians are to your home, the more likely you are to follow them, even if you don’t like them at all). If geographic distance is not orthogonal to ideology, then the ideal point estimates might be biased in systematic ways. Thus we can add an extra extra indicator variable \(g\) as follows:
\[ \begin{align} &p_{ij} = \textsf{logit}^{-1}(\alpha + \psi g_{ij} + \beta\log(x_i + x_j) - \textsf{dist}(\theta_i, \theta_j)) \\\\ &g_{ij} = \begin{cases} 1 &\text{if } i \text{ and } j \text{ are in same region} \\ 0 &\text{otherwise} \end{cases} \end{align} \]
Treat \(n\) as a random variable. All elements in \(\textsf{diag}(\mathbf A)\) (i.e. the \(n\) in the binomial likelihood) could be treated as a random variable. The reason for this is simple:
The \(n\) values come from a sample of Twitter users. Barberá removed many users located outside the USA, as well as others that don’t seem to be active in the las couple of years. In other words, we could have observed a different \(n\).
Even if we change the \(n\) values to include all followers, the amount of followers today (and thus, also mutual followers) might not be the same tomorrow.
We are thinking that a model like this:
\[ \begin{align} &w_{ij} \sim \textsf{poisson}(\lambda_{ij}) \\\\ &\lambda_{ij} = \exp(\alpha + \beta\log(x_i + x_j) - \textsf{dist}(\theta_i, \theta_j)) \\\\ &x_i = \textsf{diag}(\mathbf A)_i \end{align} \]
Or maybe like this:
\[ \begin{align} &\lambda_{ij} \sim \textsf{mvnormal}(0, \boldsymbol \Omega) \\\\ &\rho_{ij} \text{ is modeled as a function of } \textsf{diag}(\mathbf A) \text{ and } \textsf{dist}(\theta_i, \theta_j) \\\\ \end{align} \]
Or perhaps it will make more sense to model \(w_{ij}\) as if it came from a negative binomial distribution, where \(\textsf{dist}(\theta_i, \theta_j)\) affects the mean, and \(\textsf{diag}(\mathbf A)\) affects the overdispersion of the data.
Fit the model to the real data and perform model checking and/or validation (Chapters 6 and 7 of BDA).
Note. None of the models actually fit to the real data.
The data
load("data/barberá_replication_set/adj-matrix-US.rdata")
load("data/barberá_replication_set/elites-data.Rdata")
us <- elites.data[["US"]]; rm(elites.data)
Getting the data into “long form” (with Bernie Sanders being the first and Marco Rubio bein the last on the list):
A <- t(y) %*% y
name_index <- colnames(y)
s <- which(name_index %in% c("marcorubio", "SenSanders"))
A <- A[c("SenSanders", name_index[-s], "marcorubio"),
c("SenSanders", name_index[-s], "marcorubio")]
as_long_form <- function(M) {
N <- sum(upper.tri(M, diag = FALSE))
output <- matrix(NA, ncol = 3, nrow = N)
k <- 1
for (i in 1:(ncol(M) - 1)) {
for (j in (i + 1):ncol(M)) {
output[k, 1:2] <- c(i, j)
output[k, 3] <- M[i, j]
k <- k + 1
}
}
colnames(output) <- c("ii", "jj", "W")
return(output)
}
df <- as_long_form(A) %>% as_tibble()
df$name_ii <- colnames(A)[df$ii]
df$name_jj <- colnames(A)[df$jj]
stan_data <- list(
N = nrow(df),
K = ncol(A),
ii = df$ii,
jj = df$jj,
W = df$W,
D = diag(A),
tau = 0.5
)
df
## # A tibble: 50,403 x 5
## ii jj W name_ii name_jj
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 2 23477 SenSanders BarackObama
## 2 1 3 14075 SenSanders nytimes
## 3 1 4 602 SenSanders Schwarzenegger
## 4 1 5 7173 SenSanders algore
## 5 1 6 22520 SenSanders maddow
## 6 1 7 2755 SenSanders FoxNews
## 7 1 8 5951 SenSanders MittRomney
## 8 1 9 16543 SenSanders MMFlint
## 9 1 10 3379 SenSanders JerryBrownGov
## 10 1 11 924 SenSanders SarahPalinUSA
## # ... with 50,393 more rows
df_senators <- data_frame(screen_name = colnames(A),
total_followers = diag(A)) %>%
left_join(us %>% filter(!duplicated(screen_name))) %>%
select(screen_name, total_followers, location, party)
## Joining, by = "screen_name"
df_senators <- df_senators %>%
mutate(init = case_when(
screen_name == "SenSanders" ~ -1,
party == "R" ~ 1,
party == "D" ~ -1,
TRUE ~ 0
))
theta_initial <- rep(list(list(
theta = rnorm(318, df_senators$init, 0.05))), 4)
pindex <- case_when(
df_senators$party == "D" ~ 1,
df_senators$party == "R" ~ 3,
TRUE ~ 2
)
stan_data$pindex <- pindex
readLines("stan/model3b.stan") %>%
writeLines()
// Model 3B, with LOG total follower data
data {
int N; // number of observations
int K; // number of political actors
int<upper=K> ii[N]; // first actor for observation n
int<upper=K> jj[N]; // second actor for observation n
int<lower=0> W[N]; // number of mutual followers for observation n
int<lower=0> D[K]; // diag(adj. mat.) or number of (in-sample) followers
real<lower=0> tau;
}
transformed data {
int min_n[N];
for (n in 1:N) {
min_n[n] = min(D[ii[n]], D[jj[n]]); // not vectorized
}
}
parameters {
vector[K] theta; // "ideology" for k
real beta;
real alpha;
}
transformed parameters {
vector[N] eta;
eta = alpha + beta*(log(to_vector(D[ii]) + to_vector(D[jj]))) -
sqrt(square(theta[ii]-theta[jj]));
}
model {
theta[1] ~ normal(-2, tau);
for (k in 2:(K-1)) {
theta[k] ~ normal(0, 1);
}
theta[K] ~ normal(2, tau);
beta ~ normal(0, 5);
alpha ~ normal(0, 5);
W ~ binomial_logit(min_n, eta);
}
generated quantities {
vector[N] y;
for (n in 1:N) {
y[n] = binomial_rng(min_n[n], inv_logit(eta[n]));
}
}
model <- stan_model("stan/model3b.stan")
## change iter and warmup in final fit
fit <- sampling(model, data = stan_data,
seed = 123, iter = 500, warmup = 250,
control = list(max_treedepth = 15,
adapt_delta = 0.95),
init = theta_initial)
saveRDS(fit, "output/fit.rds")
fit <- readRDS("output/fit.rds")
The model doesn’t actually fit this data, as shown by the Rhats and the n_effs:
readLines("stan/model4b.stan") %>%
writeLines()
## // Model 4B, with three means (R, D, other)
## data {
## int N; // number of observations
## int K; // number of political actors
## int<upper=K> ii[N]; // first actor for observation n
## int<upper=K> jj[N]; // second actor for observation n
## int<lower=0> W[N]; // number of mutual followers for observation n
## int<lower=0> D[K]; // diag(adj. mat.) or number of (in-sample) followers
## int<lower=0> pindex[K];
## real<lower=0> tau;
## }
## transformed data {
## int min_n[N];
## for (n in 1:N) { // not vectorized
## min_n[n] = min(D[ii[n]], D[jj[n]]);
## }
## }
## parameters {
## vector[K] theta;
## real beta;
## real alpha;
## ordered[3] mu;
## }
## transformed parameters {
## vector[N] eta;
## eta = alpha + beta*(log(to_vector(D[ii]) + to_vector(D[jj]))) -
## sqrt(square(theta[ii]-theta[jj]));
## }
## model {
## mu[1] ~ normal(-1, tau);
## mu[2] ~ normal(0, tau);
## mu[3] ~ normal(1, tau);
## theta[1] ~ normal(-2, tau);
## for (k in 2:(K-1)) {
## theta[k] ~ normal(mu[pindex[k]], 1);
## }
## theta[K] ~ normal(2, tau);
## beta ~ normal(0, 5);
## alpha ~ normal(0, 5);
##
## W ~ binomial_logit(min_n, eta);
## }
## generated quantities {
## vector[N] y;
## for (n in 1:N) {
## y[n] = binomial_rng(min_n[n], inv_logit(eta[n]));
## }
## }
model2 <- stan_model("stan/model4b.stan")
## change iter and warmup in final fit
fit2 <- sampling(model2, data = stan_data,
seed = 123, iter = 1000,
control = list(max_treedepth = 15,
adapt_delta = 0.95),
init = theta_initial)
saveRDS(fit2, "output/fit2.rds")
fit2 <- readRDS("output/fit_real.rds")
How long did this take to fit?
get_elapsed_time(fit2)
## warmup sample
## chain:1 156.340 4466.680
## chain:2 134.102 758.852
## chain:3 256.706 30939.400
## chain:4 156.072 12339.500
init trickGive each politician an initialization value drawn from a uniform distribution between -1 and 1, but in accordance to their first principal component.
pca <- prcomp(A, scale = TRUE, center = TRUE)
init_df <- data_frame(init_pca = sort(runif(318, -1, 1), FALSE),
pc1 = sort(pca$rotation[ , 1], TRUE),
screen_name = names(sort(pca$rotation[ , 1]*-1)))
df_senators <- df_senators %>%
full_join(init_df)
## Joining, by = "screen_name"
theta_initial_pca <- rep(list(list(
theta = df_senators$init_pca)), 4)
fit2pca <- sampling(model2, data = stan_data,
seed = 123, iter = 500,
control = list(max_treedepth = 15,
adapt_delta = 0.95),
init = theta_initial_pca)
saveRDS(fit2pca, "output/fit2pca.rds")
fit2pca <- readRDS("output/fit2pca.rds")